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Abstract. We consider the following problem: from a given set of gene families trees on a set of 
genomes, find a first speciation, that splits these genomes into two subsets, that minimizes the number of 
gene duplications that happened before this speciation. We call this problem the Minimum Duplication 
Bipartition Problem. Using a generalization of the Minimum Edge-Cut Problem, known as Submodular 
1/^ ' Function Minimization, we propose a polynomial time and space 3-approximation algorithm for the 

Minimum Duplication Bipartition Problem. 

1 Introduction 

^ \ Gene duplication is an evolutionary mechanism, that played an major role in the evolution of 

the genomes of important groups of eukaryotes such as vertebrates [3], insects [10], plants [14] or 
fungi [18]. Gene duplications, together with gene losses, results in gene families, that can contain 
several copies of a same gene in a given genome. Recent progresses in methods for reconstructing 
1/-^ ' phylogenetic trees for individual gene families, gene trees, have resulted in large sets of accurate gene 

. trees for eukaryote species. Phylogenomics aims at reconstructing the evolution of species (genomes) 

by inferring a species tree, for a set of genomes, from a set of gene trees. The Minimum Duplication 
Problem asks to find, from a set of gene trees, a species tree that induces an evolutionary history 
f-*) ■ with a minimum number of gene duplications. It has been applied on several eukaryotic datasets 

On ! with success (see [14, 19] for example). The Minimum Duplication Problem is NP-hard [12], but it 

can be solved by a fixed-parameter algorithm, based on a parameter relevant on true datasets [9], 
and recent work on local-search heuristics have proved to be efficient to process large datasets [1, 

>< : 19]. 

H I Recently in [5, 15], a formal link between the Minimum Duplication Problem and the problem 

of reconstructing supertrees [2] has been introduced. In the supertree problem, given a set of gene 
trees from orthologous genes (at most one member of each family is conserved in each considered 
genome), the goal is to reconstruct a species tree that agrees with the maximum number of gene 
trees. This problem is NP-hard too, even in the simple cases where each gene tree contains only 
three leaves [4] or each gene tree contains a single internal vertex [5]. However, heuristics based 
on the computation of successive minimum edge-cuts in a graph whose vertices are the considered 
species have been widely used [16, 13]. In such heuristics, each minimum edge-cut, that splits the 
set of considered species in two subsets corresponds to a speciation that results in two lineages. 
A complete species tree results then from a sequence of such speciations, each obtained from a 
minimum edge-cut. 

In the present work we follow this path, and we attack the following parsimony problem: given 
a set of gene trees, find a bipartition of the considered genomes into two subsets, corresponding to 



a speciation, that minimizes the number of duphcations that happened before this speciation. We 
call this problem the Minimum Duplication Bipartition Problem. Although a restricted version of 
the more general Minimum Duplication Problem, it leads, as for supertrees, to a natural greedy 
heuristics to reconstruct a species tree from a set of gene trees. Our main result is a polynomial time 
and space 3-approximation algorithm for the the Minimum Duplication Bipartition Problem. Our 
algorithm relies on a well-known generalization of the Minimum Edge-Cut Problem, Submodular 
Function Minimization [6]. 

We first define, in Section 2, gene trees, species trees and duplications, then the Minimum 
Duplication Problem and Minimum Duplication Bipartition Problem. In Section 3 we show how 
the Minimum Duplication Bipartition Problem can be described both in terms of prefix of the gene 
trees and of a variant of the classical Minimum Edge-Cut Problem, namely the Minimum Labeled- 
Edge-Cut Problem. This problem is NP-hard in general, but we show in Section 4 how a variant 
can be solved via Submodular Function Minimization, that leads to an approximation algorithm 
for the Minimum Duplication Bipartition Problem, with ratio two times the optimal plus one: if 
an optimal first speciation implies k gene duplications, our algorithm returns, in polynomial time 
and space, a bipartition that implies at most 2A; + 1 duplications. In Section 5, we describe some 
properties of the set of all optimal bipartitions. 

2 Preliminciries 

Gene and species trees. Let Q = {1,2, . . . ,k} he a set of integers representing k different genomes 
(species). A species tree on ^ is a tree with exactly k leaves, where each i G ^ is the label of a 
single leaf. A tree is binary if every internal vertex has exactly two children. A gene tree on ^ is a 
binary tree where each leaf is labeled by an integer from Q. A gene tree is a formal representation 
of a phylogenetic tree of a gene family, where each leaf labeled i represents a member (gene) of the 
gene family located on genome i. 

Given a vertex a; of a binary tree T, we denote by L{x) (resp. L{T)) the subset of Q defined 
by the labels of the leaves of the subtree of T rooted in x (resp. the labels of the leaves of T). We 
denote by x; and Xr the two children of x if a: is not a leaf. 

The Minimum Duplication Problem. Given a gene tree G and a, possibly non-binary, species tree 
S on Q, the LCA mapping M maps vertices of G to vertices of S as follows: for a vertex x of G, 
M{x) = V is the unique vertex of S such that L{x) C L{y) and u is a leaf of S or L{x) is not 
included in the leaf set of any child of v. A vertex x of G is then a duplication with respect to 
S if M{x) = M{xr) and/or M(x) = M{xi); otherwise, x is called a speciation (see Appendix for 
more details on the evolutionary events implied by the LCA mapping). A duplication x of G is 
said to precede the first speciation if M{x) = r{S), the root of S. The same definitions apply to a 
forest F of gene trees on Q. The duplication cost of F given S denoted by d{F, S) is the number of 
vertices in F that are duplications with respect to S. It is well known that d{F, S) is the minimum 
number of gene duplication events required in any evolution scenario that resulted in F (see [7, 5] 
and references there), which leads to the following optimization problem. 

Minimum Duplication Problem (MDP): 
Input: A gene tree forest F on G; 

Output: A binary species tree S" on ^ such that d{F, S) is minimum. 
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The Minimum Duplication Problem with multiple gene trees is NP-complete [12], and there is no 
known approximation algorithm for this problem, although a fixed-parameter tractable algorithm 
has been proposes in [9]. 

A restriction to the first speciation: the Minimum Duplication Bipartition Problem. A bipartition 
B on Q IS a species tree on Q containing only three internal vertices, the root v and its children vi 
and Vr, and such that L{vi) fl L{vr) = (vi and Vr are possibly non-binary vertices). From now, we 
always denote by v the root of a bipartition B. As the way vertices of a gene tree forest F were 
labeled as duplication or speciation for a given species tree did not require that this species tree is 
binary, they apply without changes to bipartitions, as do the notion of duplication in F preceding 
the first speciation of B. We denote by di{F,B) the number of duplications that precedes the 
first speciation with respect to a bipartition B. This leads to the following optimization problem, 
that is a restriction of MDP where the parsimony assumption is restricted to the duplications that 
precedes the first speciation: 

Minimum Duplication Bipartition Problem (MDBP): 
Input: A gene tree forest F on Q; 

Output: A bipartition B on Q such that di{F,B) is minimum. 

The motivation for this problem follows a remark in [5] that MDP is in fact a slight variant of a 
supertree problem (see also [15] that explores the link between gene duplications and supertrees), 
and the fact that greedy heuristics for hard sTipcrtrec problems based on computing successive 
speciations events have proved to be effective [16, 2]. Indeed, given a bipartition B for a forest F 
of gene trees, removing all vertices x such that L{x) contains leaf labels both from L{vi) and L{vr) 
results in two sets of trees: one set of trees, Fi, with leaves that belong only to L{vi) and one, 
Fr, with leaves from L{vr)- These two forests of gene trees can then be considered independently 
similarly to F, which defines a greedy heuristic to compute a binary species tree (see [16, 13, 5]). 

We conclude these preliminaries with three obvious, but very useful, properties related to du- 
plication vertices of a forest of gene trees F. 

Property 1. Let F be a gene trees forest on Q and x a vertex of F. 

1. If L{xi) n L{xr) 7^ is a duplication vertex with respect to any species tree S (including 
bipartitions) . Such a vertex is called an apparent duplication. 

2. Given a bipartition B on Q, with root v, x is a duplication with respect to B that precedes the 
first speciation if and only if there exists a pair {s,t) € L{vi) x L(vr) such that {s,t) € L{xif' 
or (s,i) G L{xr)'^. 

3. Given a bipartition S on ^, if a; is a duplication with respect to B that precedes the first 
speciation, then every ancestor of x is too. 

3 Prefix of gene trees and Minimum Labeled-Edge Cut 
3.1 Minimum Labeled-Edge-Cut. 

Given a connected graph H = (V,-E), an edge-cut of H is an edge set E' C E whose removal 
disconnects the graph H, and the size of an edge-cut E' is the number of edges in E'. A bipartition 
B on the vertices of H induces a unique edge-cut of H denoted by E{B) and composed of the 
edges {s,t) G E such that s G L{vi) and t G L{vr). The Minimum Edge-Cut Problem is to find of 
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a bipartition on the vertices of H inducing an edge-cut of H of minimum size. It can be solved in 
linear time [17]. 

If the edges of H are labeled on a given alphabet A, the label-size of an cdgc-cut E' of H is the 
size of the subset of A defined by the labels of the edges in E'. The following problem is a natural 
variant of the Minimum Edge-Cut Problem: 

Minimum Labeled-Edge-Cut Problem: 
Input: A connected edge-labeled graph H = (y,E); 

Output: A bipartition B on V such that the label-size of the edge-cut E{B) of H induced by B 
is minimum. 

We now show how to reduce MDBP to a Minimum Labeled-Edge-Cut Problem. Let F be a 
gene tree forest with m internal nodes. We label each internal vertex of F using a unique integer of 
A = {1, . . . , m}: no two internal vertices can have the same label. We then associate to F an edge 
labeled graph H{F) = {V, E) as follows: 

— the set V of vertices of H{F) is the set L{F) of labels of leaves of F, 

— there is an edge between vertices s and labeled by a G vl if and only if the unique internal 
vertex a; of labeled by a is such that G Fixif' or € L{xr)^ . 

Lemma 1. Let F he a gene tree forest on Q and H{F) = {V,E) the edge-labeled graph associated 
to F. If B is a bipartition on L{F) then the cost di{F,B) of B is also the label-size of the edge-cut 
E{B) of H induced by B. 

Proof. If E{B) contains an edge {s,t) labeled with a (z A, then, from Property 1.2, the vertex of 
F labeled by a is a duplication that precedes the first speciation. Conversely, if a is the label of 
duplication that precedes the first speciation then there is an edge (s, t) in E[B) labeled with a. □ 

Wc conclude this section by some facts on the complexity of the Minimum Labeled-Edge- 
Cut Problem. It is naturally linked to another labeled edge-cut problem, the Minimum Label-Cut 
problem. Given a connected edge-labeled graph H = (V, E) with edges labeled on a set of labels 
A, a label- cut of is a subset A' of A such that the removal of all edges of labels in A' disconnects 
H and the size of A' is the number of labels contained in A'. The Minimum Label-Cut problem is 
then defined as follows: 

Minimum Label-Cut problem: 

Input: A connected edge-labeled graph H = {V,E); 

Output: A label-cut A' of H whose size is minimum. 

Lemma 2. An edge-labeled graph has a Minimum Labeled-Edge-Cut of label-size k if and only if 
it has a Minimum- Label- Cut of size k. 

Lemma 2 follows obviously from the definition of the two problems. It has been shown that 
the Minimum Label-Cut Problem is NP-hard when a pair of vertices that should be separated by 
the edge-cut induced by A' is given [11]. Lemma 2 implies that the same holds for the Minimum 
Labeled-Edge-Cut Problem. Note however that even the hardness of the general Minimum Labeled- 
Edge-Cut Problem would not directly imply the hardness of MDBP as not any graph can be the 
graph H{F) obtained from a gene trees forest F. 
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3.2 Prefixes of gene tree forests. 

We now describe an alternative point of view on parsimonious first speciations, that does not 
consider the graph H{F), but directly the gene trees. 

A prefix of a tree is a set / of vertices such that, if a G /, tlien every ancestor of a belongs to 
/. From Property 1.3, given a gene tree forest F, a bipartition B on L{F), with root v, induces 
a unique prefix of F composed of the set of duplications in F that precedes the first speciation. 
Conversely, a prefix I of F induces a partition P{I) of L{F) as follows: two species are in the same 
part of P{I) if and only if they belong to the same connected component in the graph obtained 
from H{F) by removing all edges corresponding to vertices in the prefix. |-P(/)| denotes the number 
of parts of P{I). We now introduce an optimization problem related to prefixes of gene tree forests. 

Minimum Duplication Prefix Problem (MDPP): 
Input: A gene tree forest F on Q; 

Output: A minimum size prefix I oi F such that |-P(/)| > 2. 

Lemma 3. Given a gene tree forest F and the edge-labeled graph H{F) associated to F, the min- 
imum label-size of an edge-cut of H{F) is equal to the minimum size of a prefix I of F such that 
|P(/)|>2. 

Proof. An edge-cut E' of H{F) induces a bipartition on L{F) inducing a prefix I of F such that 
1^(^)1 > 2 and the size of I is the label-size of E'. Conversely, given a prefix I of F such that 
> 2, any bipartition B = (Li,L2) of L{F) such that any part of P{I) is included either in 
Li or in L2 induces an edge-cut E{B) of H{F) whose label-size is less or equal to the size of /. □ 

4 A polynomial 3-approximation via submodular function minimization 

We show now how, by defining, from F, a slightly different graph than H{F), the Minimum Labclcd- 
Edge-Cut Problem can be solved, for such graphs, in polynomial time and give a 3-approximation 
for MDBP. 

4.1 Cut-set and submodular function 

A submodular function is a set function / : 2^ ^ R defined from the subsets of a finite set V to the 
set of real numbers M such that for any subsets A and B of V, f{A) + f{B) > f{A UB) + f{A n B). 

The set V is then called the ground set of f. 

Several combinatorial optimization problems have been linked to submodular functions [6]. Given a 
submodular function /, the following optimization problem, that can be solved using combinatorial 
polynomial time algorithms [8], is often considered: 

Submodular Function Minimization (SFM) Problem: 
Input: A submodular function / : 2^ ^ M with ground set V; 
Output: A subset V' of V such that f{V') is minimum. 

The Minimum Edge-Cut problem is a special case of SFM: given a graph H = {V,E), if g{X) 
denotes the size of the edge-cut E{B) induced by the bipartition B onV such that L{vi) = X and 
L{vr) = V — X, then the cut-set function g' is a submodular function. The problem of minimizing 
g is then the Minimum Edge-Cut problem on H. 
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The Minimum Labclcd-Edge-Cut problem can also be reduced to a Set Function Minimization: 
given an edge-labeled graph H = {V^E), we associate to H the cut-set function f{H) : 2^ ^ M 
defined from the subsets of F to M such that, for any subset X of F, {f{H)){X) is the label-size 
of the edge-cut E{B) induced by the bipartition BonV such that L{vi) = X and L{vr) = V — X. 
It is then easy to see that solving the Minimum Labeled-Edgc-CTit problem on H can be achieved 
by minimizing f{H). However, the cut-set function induced by the edge-labeled graph associated 
to a gene tree forest F is not always submodular, as we show now. 

We describe now a property of the cut-set function for the Minimum Labeled-Edge-Cut Problem 
that will be crucial in designing an approximation algorithm for MDBP. Given two subsets A and 
B of L{F), we define the following sets of labels: 

— AB is the set of labels of edges (s, t) in H{F) such that s ^ A — B and t ^ B — A, 

— Ci is the set of labels of edges (s, t) in H{F) such that s (z AO B anthe d t ^ AU B, 

— Ai is the set of labels of edges (s, t) in H{F) such that s G A — B and t ^ AU B, 

— B\ is the set of labels of edges (s, t) in H{F) such that s G B — A and t ^ A\J B, 

- AC is the set of labels of edges (s, t) in H{F) such that s £ Ar\B and t ^ B — A, 

- BC is the set of labels of edges (s, t) in H{F) such that s & Af^B and t ^ A — B. 

Lemma 4. If the cut-set function f = f{H(F)) induced by H(F) is not a submodular function 
then there exists at least one internal vertex x of F labeled a and two subsets A and B of L{F) 
such that: 

- (1) a e AiD AC and a ^ AB UCiU BC U Bi or 

- (2) a € Bin BC and a ^ AB U Ci LI AC LI Ai. 

and X is necessarily a node which is not an apparent duplication. 

Proof f'{A) + f{B) = \AB U Ci U U AC\ + \AB U Ci U 5i U BC\. 
f'{A UB) + f'{A nB) = \CiUAiUBi\ + \Ci U AC U BC\. 

Any label a of a vertex of F that belongs to more than two or only one of the six sets (^i, 
Bi, Ci, AB, AC, BC) contributes in f'{A) + f'{B) with a weight that is greater or equal to its 
contribution in f'{A \J B) + f'{A n B). If a belong to exactly two of the sets, then its contribution 
Ci in f'{A) + f'{B) is at least its contribution C2 in f'{A LI B) + f'{AriB) except when these sets 
are Ai and AC, or Bi and BC. In these case Ci = 1 and C2 = 2. 

The label of vertex x can not belong to only two sets if x is an apparent duplication. □ 
4.2 A submodular modification 

Lemma 4 suggests a modification of the definition of the edge-labeled graph H{F) associated to a 

gene tree forest F such that properties (1) and (2) never hold, which leads to a cut-set function 
that is submodular. Given a gene tree forest F on Q with a labeling of its internal vertices using a 
set of label A, we now associate to F, the graph I{F) = {V, E) defined as follows: 

- V = L{F), 

— there is an edge {s, t) in E, labeled with a G A if and only if there exists a vertex x in F such 
that 

• cither {s,t) £ L{xif or {s,t) G L{xr)'^ 

• or {s,t) G L(x)'^ and x is not an apparent duplication. 
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Lemma 5. Given a gene tree forest F, the cut-set function f{I{F)) associated to I{F) = (V, E) 
is a submodular function. 

Proof. Properties (1) and (2) of Lemma 4 never hold. □ 

It follows from Lemma 5 that the minimization of f{I{F)) can be solved in polynomial time. 
The following theorem shows that it provides a 3-approximation algorithm for MDBP. 

Theorem 1. Given a gene tree forest F with n vertices, on a set of k genomes, and the modified 
graph I{F) = iy,E) associated to F, if c is the minimum value of the cut- set function f = f(I{F)) 
and d is the minimum cost d\{F, B) associated to a bipartition B onV = L{F), then c < 2 * d + 1. 
Moreover, c can be computed in 0{k^nlog{kn)) time. 

Proof. Let d be the minimum cost of a bipartition on V = L(F) and let us consider a bipartition 
BonV such that di{F, B) = d. If X = L{vi), then by definition f'{X) is the number of vertices x 
in F such that there exists a couple {s,t) E X x {V — X) such that: 

(1) (s,t) G L{xi) or (s,t) € L{xr) or 

(2) {s,t) G L{x) and x is not an apparent duplication. 

Moreover, any vertex y of F that is a strict ancestor of a vertex satisfying (1) or (2), satisfies 
(1). Then, the vertices of F satisfying (1) or (2) form a prefix I of F such that only leaves of / can 
satisfy (2). This induces that if p is the number vertices of F satisfying (1) then f'{X) < 2 *p+ 1 
(since the number of leaves of a binary tree having i internal vertices is z + 1). Finally, by definition, 

the cost di (F, B) of B is equal to p and then p = d. 

The complexity follows from the algorithm described in [8]. □ 

5 The set of all optimal solutions 

A common approach in the Minimum Edge-Cut based algorithms used for supertrees problem is 
to seek not only a single parsimonious bipartition on the set of genomes, but all possible ones. The 
problem is then finding not one optimal bipartition on the set of species, but a partition compatible 
with all optimal bipartitions [16, 13]. 

Given a gene tree forest F, the partition of L{F) compatible with all optimal solutions of MDBP 
is such that two elements of L(F) are in the same part if and only if they belong to the same part 
in all optimal bipartitions. This partition is denoted by VB{F). Prom the point of view of gene 
trees prefixes, there can also be several optimal solutions for MDPP. The unique partition of L{F) 
compatible with all optimal solutions of MDPP is such that two elements of L{F) are in the same 
part if and only if they belong to the same part in all partitions induced by optimal prefixes. This 
partition is denoted by VV{F). 

Proposition 1. Given a gene tree forest F, VB{F) = W{F). 

Proof. This result is straightforward consequence of the definition of VB{F) and W{F) since the 
set of edges that belong to a minimum label-size edge-cut of H{F) is exactly the set of edges that 
have a label belonging to a minimum size label-cut of H{F). □ 
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For the classical Minimum Edge-Cut problem, the problem of computing the unique partition 
of the set of vertices of a graph compatible with all minimum edge-cuts can be reduced to a decision 
problem on edges of the graph that can be solved efficiently [17]. We generalize it below, and give 
also an equivalent formulation in terms of gene trees prefixes: 

Minimum Duplication Bipartition Edge Problem: 
Input: A graph G and an edge e of G; 

Output: Does e belong to a minimum label-size edge-cut of G. 

Minimum Duplication Prefix Vertex Problem: 
Input: A gene tree forest F and a vertex a of F; 

Output: Does a belong to a minimum size prefix I of F such that > 2. 

Lemma 6. Given a gene tree forest F, computing VB{F) can be reduced to solving the Minimum 
Duplication Bipartition Edge Problem for each edge {s,t) of H{F) or the Minimum Duplication 
Prefix Vertex Problem for each vertex a of F. 

Proof. The partition 'PB{F) is such that two elements of L{F) are in the same part if and only if 
they belong to the same connected component in the graph obtained from H{F) after removing 
all edges belonging to a minimum label-size edge-cut of H{F). The equivalence with the Minimum 
Duplication Prefix Vertex Problem follows immediately from Proposition 1. □ 

6 Conclusion 

We showed that computing a parsimonious first spcciation in the gene duplication model can be 
approximated in polynomial time with a ratio of 3. As far as we know this is the first time a constant 
approximation algorithm is proposed in relation with the problem of inferring species trees using 
gene duplications. This result was obtained by describing it in terms of edge-cuts in particular 
graphs, that can be computed in polynomial time through submodular functions minimization. 

The complexity status of the Minimum Duplication Bipartition Problem is still open, but its 
relationship to the Minimum Label-Cut Problem, together with the fact the corresponding cut-set 
function is not submodular seems to indicate it is NP-hard. As for classical minimum edge-cut, this 
question is strongly related to the question of the complexity of deciding if a single edge of a graph 
belongs to a minimum labeled-edge-cut, or if an edge label belongs to a minimum label-cut. From 
an algorithmic point of view, there is still a large gap between the near linear time complexity of 
the simple Minimum Edge-Cut Problem and the high complexity of our version of the Minimum 
Labeled-Edge-Cut Problem. In order to make the approximation algorithm wc propose useful in a 
practical context, some advances on the complexity of the Minimum Labeled-Edge-Cut Problem 
are necessary. 

Finally, it is important to remark that all these questions have been described in terms of edge- 
cuts in a graph defined by a gene tree forest, but not all graphs can be induced by a gene trees 
forest. Hence, for our particular purpose, if one hopes to show the problems wc introduced are 
tractable, it is probable that we should use specific properties of these graphs. Alternatively, using 
prefixes of gene trees could be a way to attack these questions. 
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Appendix 



Reconciliation. A subtree insertion in a forest F consists in grafting a new subtree onto an existing 
branch of F. An extension of F is a tree which can be obtained from F by subtree insertions on 
the branches of T. 

A gene tree forest F is said to be DS-consistent with a binary species tree 5 on ^ if, for every 
vertex x in F such that \L{x) > 2|, there exists a vertex v in S such that L[x) = L{v) and one of 
the following conditions holds: 

(D) either L{xi) = L{xr) (indicating a Duplication), 

(S) or L{xi) = L{yi) and L{xr) = L{vr) or inversely (indicating a Speciation). 

A reconciliation between a gene tree forest F and a binary species tree S on on ^ is an extension 
i2(F, S) of F that is DS-consistent with S. A reconciliation between F and S implies an unambiguous 
evolution scenario for the gene family F where a vertex of R{F, S) that satisfies property (D) 
represents a duplication, and an inserted subtree represents a gene loss. Vertices of R{F, S) that 
satisfy property (S) represent speciation events. It is immediate to see that every vertex x of T such 
that L{xi) nL{xr) 7^ will always be a duplication vertex in any reconciliation R{F, S) between F 
and S. 

Optimization problems. The following cost measure is considered for a reconciliation R{F, S) be- 
tween a gene tree forest F and a species tree S: the duplication cost of R{F, S) denoted by 
d{R{F, S), S) is the number of duplications induced by R{F, S). When a species tree is not known, 
the following natural combinatorial optimization problems is often considered. 

Minimum Duplication Problem I: 
Input: A gene tree forest F on ^; 

Output: A binary species tree S such that d{R{F, S), S) is minimum. 

Given a gene tree forest F and a species tree S on Q, the LCA mapping between F and S 
induces a reconciliation between F and S where an internal vertex x of F leads to a duplication 
vertex if M{xi) = M{x) and/or M[xr) = M(x). In [7], it has been shown that the reconciliation 
M{F, S) between F and S defined by their LCA mapping minimizes the duplication cost. 
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